Combining Bilingual and Comparable Corpora for Low Resource Machine Translation
نویسندگان
چکیده
Statistical machine translation (SMT) performance suffers when models are trained on only small amounts of parallel data. The learned models typically have both low accuracy (incorrect translations and feature scores) and low coverage (high out-of-vocabulary rates). In this work, we use an additional data resource, comparable corpora, to improve both. Beginning with a small bitext and corresponding phrase-based SMT model, we improve coverage by using bilingual lexicon induction techniques to learn new translations from comparable corpora. Then, we supplement the model’s feature space with translation scores estimated over comparable corpora in order to improve accuracy. We observe improvements between 0.5 and 1.7 BLEU translating Tamil, Telugu, Bengali, Malayalam, Hindi, and Urdu into English.
منابع مشابه
Neural Machine Translation for Low Resource Languages using Bilingual Lexicon Induced by Comparable Corpora
Automatically extracting parallel sentence pairs from the multilingual articles available on the Internet can address the data sparsity problem in building multilingual natural language processing applications, especially in machine translation. In this project, we have used an end-to-end siamese bidirectional recurrent neural network to generate parallel sentences from comparable multilingual ...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملAn Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora
Automatically building a large bilingual corpus that contains millions of words is always a challenging task. In particular in case of low-resource languages, it is difficult to find an existing parallel corpus which is large enough for building a real statistical machine translation. However, comparable non-parallel corpora are richly available in the Internet environment, such as in Wikipedia...
متن کاملChinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts
Although there are increasing and significant ties between China and Portuguese-speaking countries, there is not much parallel corpora in the Chinese–Portuguese language pair. Both languages are very populous, with 1.2 billion native Chinese speakers and 279 million native Portuguese speakers, the language pair, however, could be considered as low-resource in terms of available parallel corpora...
متن کاملHarvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics
Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our new methodologies for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013